-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Workflow Registry Metrics and remove executions running counter #16463
base: develop
Are you sure you want to change the base?
Conversation
I see you updated files related to
|
@@ -1153,7 +1153,6 @@ func (e *Engine) heartbeat(ctx context.Context) { | |||
return | |||
case <-ticker.C: | |||
e.metrics.incrementEngineHeartbeatCounter(ctx) | |||
e.metrics.updateTotalWorkflowsGauge(ctx, e.stepUpdatesChMap.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am removing this metric entirely. Read PR description for why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... this still feels like it could be useful @patrickhuie19 -- I can imagine scenarios where the number of executions is just increasing, and not going down such that we'll eventually exhaust the queue. In that case this metric would help us diagnose what's going on.
You mentioned that this relates to deletes of the workflow engine -- can you say more about how? I'm not following entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that one source of confusion here is that this name of this metric (total workflows running) doesn't correspond to what it's measuring (number of executions running an in instance of the workflow engine); maybe that's the source of confusion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be more confusing than helpful. The name is part of it. The metric is updated on a heartbeat, which is async to engine.Start() and engine.Close().
I can imagine scenarios where the number of executions is just increasing, and not going down such that we'll eventually exhaust the queue
Thanks for the example use case. Would we be able to satisfy this by looking at the completed duration histogram metrics and observing no executions are finishing?
@@ -1266,7 +1265,7 @@ const ( | |||
defaultQueueSize = 100000 | |||
defaultNewWorkerTimeout = 2 * time.Second | |||
defaultMaxExecutionDuration = 10 * time.Minute | |||
defaultHeartbeatCadence = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minutes is anemic
// tryEngineCleanup attempts to stop the workflow engine for the given workflow ID. Does nothing if the | ||
// workflow engine is not running. | ||
func (h *eventHandler) tryEngineCleanup(wfID string) error { | ||
if h.engineRegistry.IsRunning(wfID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had logic based on whether or not the underlying Ready
implementation in the embedded services.StateMachine would return an error. The Engine has no subservices, Close() isn't marked as syncronous with Ready in services.StateMachine, and if the Engine wasn't Ready, we would do nothing anyways.
If we want to stay flexible to checking the status of Ready in the future, we could add some type of sync loop to check status and delete when Ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickhuie19 Are we subtly changing behaviour here? We'll return an error now if we can't retrieve the engine from the registry; whereas previous we would have ignored that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll change it so that engineCleanup errors under the same condition as previously to keep the scope of this PR smaller. The tradeoff with handling a lacking engine instance this gracefully is that it might fly under our radar when engine instances are getting cleared from the registry before they should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AER Report: CI Coreaer_workflow , commit , Clean Go Tidy & Generate , Detect Changes , Scheduled Run Frequency , Core Tests (go_core_tests) , Core Tests (go_core_tests_integration) , Core Tests (go_core_ccip_deployment_tests) , GolangCI Lint (.) , Core Tests (go_core_fuzz) , Core Tests (go_core_race_tests) , test-scripts , lint , SonarQube Scan 1. GolangCI Lint job failed due to non-wrapping format verb in fmt.Errorf[A 1 <= 10 words sentence that describes the error]:[job id where the error happened] Source of Error:
Why: The error is caused by the use of Suggested fix: Replace AER Report: Operator UI CI ran successfully ✅ |
86e8af2
to
b7c5a58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm too tired but I'm really struggling to make sense of the description of this PR.
What I see is going on:
- You are removing "workflowsRunningGauge" because it is pretty useless. It's counting current number of executions at a given point in time, updated only with the heartbeat, correct?
- You are adding a bunch of new counters to track all events in the Workflow Registry. Those make sense to me, especially "totalWorkflowsGauge". It won't cover workflows added outside the Registry but I guess that's fine.
What I don't understand is how this is going to help debug why metrics are lingering after workflow deletion. You say that other metrics hang around too, not only the executions gauge. Are you able to reproduce it locally? What new signal are you expecting to get once we start running this in prod?
Thanks, simplified the description. Your points 1 and 2 are correct. There were multiple levers into the changes in this PR, and they are worthwhile on their own outside of the context of the time-series persisting post deletion bug we're tracking down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
metrics.incrementRegisterCounter(ctx) | ||
|
||
// intentionally without workflow specific labels | ||
h.metrics.updateTotalWorkflowsGauge(ctx, int64(h.engineRegistry.Size())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but slightly misleading: this really only measures workflows that have been started by the syncer -- we should add a increment in the workflows/delegate to give an accurate picture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
m, err := initMonitoringResources() | ||
if err != nil { | ||
lggr.Fatalw("Failed to initialize monitoring resources", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. this will kill the node. that seems unreasonable, we don't know what else is running.
where is this init all use to live?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it really should block the whole node, then panic
if not, then critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular error to get invoked it signals a potentially catastrophic regression in Beholder impl. But agreed we don't know what else is running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ound to make that possible
The executions running gauge can be misleading since it is updated on heartbeat, which is async to workflow deletes through the workflow registry.
I saw we're lacking some metrics in the workflow registry and so added coverage.